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Generalized linear mixed-effects models in the context of genome- wide association studies (GWAS) 
represent a formidable computational challenge: tlie solution of millions of correlated generalized 
least-squares problems, and the processing of terabytes of data. We present high performance in- 
core and out-of-core shared-memory algorithms for GWAS: By taking advantage of domain-specific 
knowledge, exploiting multi-core parallelism, and handling data efficiently, our algorithms attain 
unequalled performance. When compared to GenABEL, one of the most widely used libraries for 
GWAS, on a 12-core processor we obtain 50-fold speedups. As a consequence, our routines enable 
genome studies of unprecedented size. 
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analysis, Efficiency 
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1. INTRODUCTION 

Generalized linear mixed-effects models (GLMMs) are a type of statistical model 
widespread in many different disciplines such as genomics, econometrics, and social 
sciences [Teslovich at al. 2010; Antonio and Beirlant 2007; Gibbons at al. 2010]. 
Applications based on GLMMs face two computational challenges: the solution of 
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a sequence comprising millions of generalized least-squares problems (GLSs), and 
the processing of data sets so large that they only fit in secondary storage devices. 
In this paper, wc target the computation of GLMMs in the context of genome-wide 
association studies (GWAS). 

GWAS is the tool of choice to analyze the relationship between DNA sequence 
variations and complex traits such as diabetes and coronary heart diseases [Lauc 
et al. 2010; Levy et al. 2009; Speliotes et al. 2010]. More than 1400 papers published 
during the last five years endorse the relevance of GWAS [Hindorff et al. ]. The 
GLMM specific to GWAS solves the equation 

bi := {XlM-^Xi)-^XlM-^y, with l<i<m (1) 

where y is the vector of observations, representing a given trait or phenotype; Xi 
is the design matrix, including covariates and genome measurements; M represents 
dependencies among observations; and bi represents the relation between a variation 
in the genome sequence {Xi) and a variation in the trait (y). In linear algebra terms, 
Eq. (1) solves a linear regression with non-independent outcomes where 6j e TV, 
Xi e 7e"^P is full rank, M e 7e"^" is symmetric positive definite (SPD), and 
y G 7?."; the sizes are as follows: n « 10^, 1 < p < 20, and m, the length of 
the sequence, ranges from 10® to 10^. The quantities Xi, M, and y are known. 
Additionally, the Xi's present a special structure that will prove to be critical for 
performance: each Xi may be partitioned as {Xl \ Xr.), where Xl is the same for 
all X,'s. 

1.1 Limitations 

Computational biologists performing GWAS aim for the sizes described above; in a 
typical scenario, 3 Terabytes of data have to be processed through 3.6 x 10^^ arith- 
metic operations (Petaflops). In practice, current GWAS solvers are constrained to 
much smaller problems due to time limitations. For instance, in [Aulchenko et al. 
2010], the authors carry out a study that takes almost 4 hours for the following 
problem sizes: n = 1,500, p = 4, and m = 220,833. The time to perform the same 
study for m = 2.5 million is estimated to be roughly 43 hours. With our routines, 
the time to complete the latter reduces to 10 minutes. 

1.2 Terminology 

We collect and give a brief description of the acronyms used throughout the paper. 

— GWAS: Genome- Wide Association Studies 

— GLS: Generalized Least-Squares problems 

— GenABEL: One of the most widely used frameworks to perform GWAS 
— GWFGLS: GenABEL's state-of-the-art routine for the solution of Eq. (1) 
— HP- GWAS: our novel in-core solver for Eq. (1) 
— OOC-HP-GWAS: out-of-core version of HP-GWAS. 

Table I enumerates the BLAS (Basic Linear Algebra Subprograms) [Dongarra 
et al. 1990] and LAPACK (Linear Algebra PACKage) [Anderson et al. 1999] routines 
used in the algorithms presented in this paper. LAPACK and BLAS are the de-facto 
standard libraries for high-performance dense linear algebra computations. 
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Table I 



BLAS 1 and 2 



DOT 


Dot product 


T 

a := X y 


GEMV 


Matrix- vector product 


y:= Ax + y 


TRSV 


Triangular system with single right-hand side 


Ax = b 


BLAS 3 


GEMM 


Matrix-matrix product 


C:=AB + C 


SYRK 


Rank-k update 


C := A^A + C 


TRSM 


Triangular system with multiple right-hand sides 


AX = B 


LAPACK 


GETRI 


Inversion of a general matrix 




GESV 


General system with multiple right-hand sides 




POSV 


SPD system with multiple right-hand sides 




POTRF 


Cholesky factorization 




GELS 


Solution of a least-squares problem 




GGGLM 


Solution of a general Gauss-Markov linear model 





1.3 Related Work 

Traditionally, LAPACK is the tool of choice to develop high-performance algorithms 
and routines for linear algebra operations. Although LAPACK does not support 
the solution of a single GLS directly, it offers routines for closely related problems: 
GELS for least squares problems, and GGGLM for the general Gauss-Markov linear 
model. Algorithms 1 and 2 provide examples for the reduction of GLS problems to 
GELS and GGGLM, respectively. Unfortunately, none of the algorithms provided by 
LAPACK is able to exploit the sequence of GLSs within GWAS, nor the specific 
structure of its operands. Conversely, existing ad- hoc routines for Eq. (1), such 
as the widely used GWFGLS, are aware of the specific knowledge arising from the 
application, but exploit it in a sub-optimal way. 

LAPACK and GenABEL present additional drawbacks: LAPACK routines are 
in-core, i.e., data must fit in main memory; since GWAS may involve terabytes 
of data, it is in general not feasible to use these routines directly. Contrarily, 
GenABEL incorporates an out-of-core mechanism, but it suffers from significant 
overhead. 
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Algorithm 1: GLS problem reduced to GELS 



(potrf) 

(trsv) 

(trsm) 



l^^^^^^^p Algorithm 2: GLS problem reduced to GGGLM ^^^^^^^ 

1 LL'^ = M (potrf) 

2 b ■.^GGGLm{X , y , L) 

1.4 Contributions 

We present high-performance in-core and out-of-core algorithms, hp-GWAS and 
OOC-HP-GWAS, and their corresponding routines for the computation of GWAS 
on multi-threaded architectures. 

Our algorithms are optimized not for a single instance of the GLS problem but 
for the whole sequence of such problems. This is accompUshed by 

— breaking the black box structure of traditional libraries, which impose a separate 

routine call for each individual GLS, 
— exploiting domain-specific knowledge such as the particular structure of the 

operands, 

— grouping successive problems, allowing the use of high performance kernels at 

their full potential, and 
— organizing the computation to use multiple types of parallelism. 

When combined, these optimizations lead to an in-core routine that outperforms 
GenABEL's GWFGLS by a factor of 50. 

Additionally, we enable the solution of very large sequences of problems by incor- 
porating an efficient out-of-core mechanism to our in-core routine. Thanks to this 
extension, the out-of-core routine is capable of sustaining the high performance of 
the in-core one for data sets as large as the secondary storage. 

1.5 Organization 

Section 2 details, through a series of improvements, how hp-GWAS exploits both 
domain-specific knowledge and the BLAS library to attain high performance and 
scalability. In Section 3, we quantify the gain of each improvement and present a 
performance comparison between hp-GWAS and GWFGLS. Section 4 exposes the key 
ideas behind the out-of-core mechanism leading to OOC-hp-GWAS, which maintains 
HP-GWAS performance for very large sets of data. Out-of-core results are provided 
in Section 5. We discuss future work in Section 6, and draw conclusions in Section 7. 

2. IN-CORE ALGORITHM 

We commence the discussion by describing the incremental steps to transform a 
generic algorithm for the solution of a single GLS problem into a high-performance 
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algorithm that 1) solves a sequence of GLS problems, 2) exploits GWAS-specific 
knowledge, and 3) exploits multi-core parallelism. The resulting algorithm is then 
used in Section 4 as a starting point towards a high-performance out-of-core algo- 
rithm. 

Algorithm 3 solves a generic GLS problem. The approach consists in first reduc- 
ing the GLS to a linear least-squares problem (as shown in Algorithm 1), and then 
solving the associated normal equations {X'^ X)~^X'^y, where the coefficient ma- 
trix X g i?"^P is full rank and n > p. To this end, Algorithm 3 first factors M via a 
Cholesky factorization: LL^ = M; and then, it solves the systems X :— L~^X and 
y := L~^y. Several alternatives exist for the solution of the normal equations; for 
a detailed discussion we refer the reader to [Golub and Van Loan 1996; A. Bjorck 
1996]. Numerical considerations allow us to safely rely on the Cholesky factorization 
of the SPD matrix S := X'^ X without incurring instabilities. The algorithm com- 
pletes by computing b := X^y and solving the linear system b — S^^b. For each 
operation in the algorithm, we specify in brackets the corresponding BLAS/LA- 
PACK routine. 



Algorithm 3 (black-box): Solution of a GLS problem 



X := L-^X 



y:=L-^^ 



S := X^X 

b:^X^y 

b:=S-^b 



(potrf) 

(trsm) 

(trsv) 

( SYRK ) 

(gemv) 

(POSV) 



Algorithm 3 solves a single GLS problem. The algorithm may be used to solve 
a sequence of problems in a black box fashion, i.e., for each individual coefficient 
matrix AT^, use Algorithm 3 to solve the corresponding GLS problem. As the reader 
might have noticed, this approach leads to a considerable amount of redundant 
computation. We avoid the black box approach, and exploit the fact that we are 
solving a sequence of correlated problems. A closer look at Algorithm 3 reveals 
that, since only X varies from problem to problem, operations at lines 1 and 3 
may be performed once and reused across the sequence. The resulting Algorithm 4 
greatly reduces the computation performed using a black box approach. 



Algorithm 4 (seq-gls): Solution of a sequence of GLSs 

Li^ = M (potrf) 

y := L^^y (trsv) 
for each Xi 

Xi := L~^Xi (trsm) 

S^■.= XfX^ (SYRK) 

b,:^Xfy (gemv) 
b,:=Si\ (posv) 
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Stl 




Str 


= XTXn 


Sbl 


= XJ^Xl 


Sbr 


= ^R^R, 



^B = Xj^yBj 

Fig. 1: Computation of S := X"^ X, and b := X'^y in terms of the parts of X: {Xj^ \ Xn). L, R, 
T, and B, stand for Left, ijight, Top, and Bottom, respectively. 

Although Algorithm 4 already solves a sequence of GLS problems, it is still sub- 
optimal in a number of ways. The first crucial step towards high performance 
is a reorganization of the computation. A large percent of the computation in 
the loop is carried out by the TRSM operation at line 4. Even though trsm is 
a BLAS-3 operation, the fact that the system is solved for, at most, 20 right- 
hand sides does not allow TRSM to reach its peak performance; thus, the overall 
performance is affected. To overcome this limitation, we take advantage again from 
the sequence of problems: we group multiple trsms corresponding to successive 
problems L~^Xi into a larger trsm with enough right-hand sides to deliver its 
maximum performance, i.e., L~^X, where X represents the collection of all X'a: 
{Xi\X2\ ... 

As a further improvement, we focus on the knowledge specific to GWAS: the 
special structure of A". Each individual Xi may be partitioned in (X^ \ Xr.), where 
X]^ is fixed; thus the trsm operation L~^X may be split into two trsms: L~^X]^ 
and L~^Xji, where represents the collection of all Xj^'s: {Xj^^ \ Xji^ \ ■ ■ ■ \ Xr^ ). 
Additionally, the fact that Xl is fixed allows for more computation reuse: as shown 
in Fig. 1, the top left part of Si, and the top part of bi are also fixed. The resulting 
algorithm is assembled in Algorithm 5, HP-GWAS. 

^ Algorithm 5 (hp-GWAS): Solution of the GWAS-specific sequence of"GLSs^| 

1 Li^ = M ( POTRF ) 

2 Xl'—L^^Xl (trsm) 

3 Xr := L^^Xr (trsm) 

4 y := L^^y (trsv) 

5 Stl ■= XIXl (syrk) 

6 bx :— Xly (gemv) 

7 for each Xr. 

s Sbl, XJ^^Xl (gemv) 

9 Sbr, := X]^^Xr^ (dot) 

10 bB, ■■= X]^,y (dot) 
b,:=S:r^b, (POSv) 



2.1 GenABEL's gwfgls 

For completeness we provide in Algorithm 6 the algorithm implemented by Gen- 
ABEL's GWFGLS routine. The algorithm takes advantage from the specific structure 
of GWAS by computing lines 1 and 2 once, and reusing the results across the se- 
quence of problems. Unfortunately, a number of choices prevent it from attaining 
high performance: 

— the inversion of M (line 1) performs 6 times more computation than a Cholesky 

ACM Transactions on Mathematical Software, Vol. V, No. N, Month 20YY. 



Solving Sequences of GLS Problems on Multi-threaded Architectures • 7 







Computational cost 


Ratio Alg.# / Alg. 5 


Alg. 


3 (black-box) 


« n 


0{mn^) 


Alg. 


4 (seq-gls) 


Kip 


0{n^ + mn^p + mnp^) 


Alg. 


5 (hp-gwas) 


1 


0{tT' + mn^ + ranp) 


Alg. 


6 (gwfgls) 


« 2 


0{n^ + mn^ + mnp^) 



Table II: Asymptotic cost of each of the presented algorithms for GWAS. The ratio 
over HP-GWAS shows the progressive improvement made from the initial black box 
approach. hp-GWAS also improves the cost of GWFGLS by a constant factor. 



factorization of M, 

— line 2 breaks the symmetry of the expression {X'^ M^^X)^^ , which translates 
into doubling the amount of computation performed, 

— the BLAS-2 operation at line 4 (gemv) could be cast as a single BLAS-3 gemm 
involving all Xfi's (what we called X^. in Algorithm 5); GWFGLS does not include 
this improvement, thus it does not benefit from GEMm's high performance. 



Algorithm 6 (gwfgls): GenABEL's algorithm for GWAS 

M ^ M'^ (getri) 

Wl'.^XlM (gemm) 
for each Xft,. 

W'^^-.^X'^M (gemv) 

S^-^W^X, (gemm) 

bi := W'^y (gemv) 

hi := S^^h (gesv) 



2.2 Computational cost 

Table II includes the asymptotic cost of Algorithms 3-6 together with the ratio 
over our best algorithm, hp-GWAS. A discussion of the provided data follows. 

(1) The solution of a single GLS problem via Algorithm 3, black-box, has a 
computational cost of 0{n^). The solution of a sequence of such problems 
using this algorithm as a black box entails thus 0{mn^) flops, corresponding 
to the computation of m Cholesky factorizations. Clearly, this is not the best 
approach to solve a sequence of correlated problems: it performs n times more 
operations than hp-GWAS. 

(2) The key insight in Algorithm 4, SEQ-GLS, is to take advantage from the fact 
that we are solving not one but a sequence of correlated problems. Based on 
an analysis of dependencies, the algorithm breaks the rigidity of black-box, 
and rearranges the computation. As a result, the computational cost is reduced 
by a factor of n/p. Even though SEQ-GLS represent a great improvement with 
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respect to black-box, it is still not optimal for GWAS: a further reduction of 
redundant computation is possible. 

(3) HP-GWAS incorporates two further optimizations to overcome the limitations 
of SEQ-GLS. First, the algorithm exposes the structure of X, and the quan- 
tities computed from it, completely eliminating redundant operations. Then, 
the computation is carefully reorganized to exploit the full potential of the 
underlying libraries, resulting in an extremely efficient algorithm (see Fig. 2). 

(4) As for GWFGLS, it benefits from both the sequence of problems and the spe- 
cific structure of X. Unfortunately, the algorithm fails at exploiting the ex- 
isting symmetries, thus performing twice as much computation as HP-GWAS. 
Additionally, the algorithm is not properly designed to benefit from the highly- 
optimized BLAS library, having a negative impact on its performance. 

2.3 Parallelism 

HP-GWAS relies on a set of kernels provided by the highly-optimized BLAS and 
LAPACK libraries. In this situation, a straightforward approach to target multi- 
core architectures is to link the routine to a multi-threaded version of the libraries. 
While the first section of hp-GWAS (lines 1 to 6) benefits from this approach, show- 
ing high scalability, the second section (lines 7 to 11) does not scale. Therefore 
the weight of the second, although small in the sequential case, increases with the 
number of cores, affecting the overall scalability. To address this shortcoming we 
use a different parallelization scheme for the two sections: multi-threaded BLAS 
for lines 1 to 6, and OpenMP parallelism with single-threaded BLAS for lines 7 to 
11. As we show in the next section, the resulting routine is highly scalable. 

3. PERFORMANCE RESULTS (I) 

We turn now the attention towards the experimental results. We first report on 
timings for all four presented algorithms for the sequential case; the goal is to 
show and discuss the effect of the improvements described in the previous section. 
Then we focus on hp-GWAS and GWFGLS; we concentrate on timings for the multi- 
threaded versions of the routines and their scalability. 

3.1 Experimental setup 

As a computing environment we chose an architecture that we believe is readily 
available to most computational scientists. All four algorithms were implemented 
in C. Although GenABEL's interface is written in R, gwfgls and most of its 
routines are written in C. We ran all tests on a SMP system made of two Intel 
Xeon X5675 multi-core processors. Each processor has six cores, operating at a 
frequency of 3.06 GHz, for a combined peak performance of 146.88 GFlops/sec. 
The system is equipped with 32GB of RAM and 1TB of disk as secondary memory. 
We compiled the routines with the GNU C Compiler (gcc, version 4.4.5), and linked 
to a multi-threaded Intel's MKL library (version 10.3). hp-gwas also makes use 
of the OpenMP parallelism provided by the compiler through a number of pragma 
directives. 
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m 
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HP-GWAS 


— ♦— 


--V " 


GWFGLS 


o- 


SEQ-GLS 




BLACK-BOX 



Efficiency: 




HP-GWAS 


94% 


GWFGLS 


12% 


SEQ-GLS 


8% 


BLACK-BOX 


0% 



Fig. 2: Comparison of the four presented algorithms for GWAS. The gain of each 
individual improvement from BLACK-BOX to HP-GWAS is illustrated. Additionally, 
our best algorithm, HP-GWAS, outperforms state-of-the-art GWFGLS by a factor 
of 8. All experiments were performed using a single thread. The other problem 
dimensions are: n = 10,000, and p = 4. 



3.2 Results 

Fig. 2 shows the timings of all four algorithms for an increasing value of m, the 
number of GLS problems to be solved. The experiments were run using a single 
thread. The results for black-box exemplify the limitations of solving a sequence 
of correlated problems as if they are unrelated: no matter how optimized the algo- 
rithm is for a single instance, it cannot compete with algorithms specially tailored 
to solve the sequence as a whole. As a first step towards high performance, seq-gls 
reuses computation across the sequence of problems. Consequently, the algorithm 
reduces dramatically the execution time of the naive black-box approach, leading 
to a speedup greater than 250. 

HP-GWAS further reduces the execution time of SEQ-GLS by a factor of 12. The 
gain is explained by the effect of two optimizations. On the one hand, HP-GWAS 
exploits application-specific knowledge, the structure of X, leading to a speedup of 
p = 4 (larger values of p result in even larger speedups). On the other hand, the 
computation is reorganized taking into account high-performance considerations. 
It is a common misconception that every BLAS routine attains the same efficiency. 
However, due to architectural constraints such as memory hierarchy and associ- 
ated latency, BLAS-3 routines attain higher efficiency than BLAS-1 and BLAS-2. 
Therefore, rewriting multiple TRSVs (BLAS-2) as a single large trsm (BLAS-3), 
our algorithm achieves an extra speedup of 3. As shown in Fig. 2, hp-GWAS is an 
efficient algorithm to carry out GWAS; it attains 94% of the architecture's peak 
performance. 

Although GWFGLS is aware of the specific properties of GWAS and benefits from 
such knowledge, the algorithm suffers from inefficiencies similar to SEQ-GLS: it still 
performs redundant computation, and it is not properly tailored to benefit from 
BLAS-3 performance. The combination of both shortcomings results in a routine 
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1 2 4 6 8 10 12 
Number of threads 

Fig. 3: Scalability of hp-GWAS and GWFGLS. While GWFGLS' speedup plateaus at 
2, and the gain is minimal for more than 4 cores, hp-GWAS attains high-scalability 
and an even larger speedup is foreseen for a greater number of cores. The problem 
dimensions are: n = 10,000, p — 4:, and m — 100,000. 

that is 8 times slower than HP-GWAS. 

Henceforth, we concentrate on hp-GWAS and GWFGLS. In Fig. 3 we report on the 
scalability of both algorithms. As the figure reflects, while GWFGLS barely reaches a 
speedup of 2, completely stalling after 6 cores are used, hp-GWAS attains a speedup 
of almost 11 when using 12 cores. Most interestingly, the tendency clearly shows 
that larger speedups are expected for hp-GWAS when increasing the number of cores 
available. 

The disparity in the scalability of these two algorithms is mainly due to their use 
of the BLAS library. In the case of GWFGLS, the algorithm casts most of the com- 
putation in terms of the BLAS-2 operation GEMV, which, being a memory-bound 
operation, is limited not only in performance but also in scalability. Instead, as 
described earlier, hp-GWAS mainly builds on top of trsm (BLAS-3), which attains 
high scalability when operating on a large number of right-hand sides. 

We provide in Fig. 4 timings for both algorithms when using 12 threads. As 
expected, the speedup of hp-GWAS with respect to GWFGLS soars from 8 to 50. 

4. OUT-OF-CORE ALGORITHM 

So far, we have developed an algorithm for GWAS that overcomes the limitations 
of current approaches. It 
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Fig. 4: Timings for the multi-threaded versions of HP-GWAS and GWFGLS. Thanks 
to a much better scalabihty, our routine outperforms GWFGLS by a factor of 50. 
The experiments were run using 12 threads. The other problem dimensions are: 
n = 10,000, and p = 4. 



(1) solves a sequence of GLS problems, 

(2) exploits the available knowledge specific to GWAS, and 

(3) achieves high performance and scalability. 

However, the algorithm presents a critical limitation: data must fit in main mem- 
ory. The most common scenarios of GWAS require the processing of data sets that 
greatly exceed common main memory capacity: in a typical scenario, where 36 
millions of GLS problems arc to be solved with n = 10,000, the size of the input 
operand is roughly 3 terabytes. To overcome this limitation, we turn our atten- 
tion to out-of-core algorithms [Toledo 1999] . The goal is to design algorithms that 
make a proper use of available input/output (I/O) mechanisms to deal with data 
sets as large as the hard-drive size, while sustaining in-core high performance. 

We regard the solution of GWAS as a process that takes as input a large stream 
of data, corresponding to successive GLS problems, and generates as output a large 
stream of data corresponding to the solution of such problems; thus, it demands 
out-of-core algorithms that efficiently stream data from secondary storage to main 
memory and vice versa. 

We compare two approaches to data streaming. The first, used by GenABEL, is 
based on non-overlapping synchronous I/O; because of wait states, this approach 
introduces a considerable overhead in the execution time. The second, based on 
the well-known double-bufi^ering technique, allows the overlapping of I/O with com- 
putation; thanks to the overlapping, wait states, and the associated overhead, are 
reduced or even completely eliminated. 
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READ Xuk, 



COMP(X„fc, , y) 

WRITE buk, 
READ Xuk,+i 

WRITE biik,^^ 
READ Xtik,^, 

COMP(Xuki+^, y) 
WRITE buk,^. 



Fig. 5: Non-overlapping approacii to data streaming for out-of-core GWAS. I/O causes an overhead 
of 5% to 10%. 



Algorithm 7: Out-of-core algorithm for GWAS based on non-overlapping I/O 

LL'^ = M ( POTRF ) 

Xl:=L-^Xl (trsm) 
y L~^y (trsv) 
Stl-^XIXl (syrk) 
6t := X^y (gemv) 
for each Xuk in Xr 
read (Xbife) 

Xuk--=L-^Xuk (trsm) 
for each X/j^ in Xuk 

Sbl, ■■= X'^^Xl (gemv) 

SBRr-=xi^XR^ (dot) 

bBr-=Xl^y (dot) 
b,:=S^^h (posv) 
write (5h;fc) 



4.1 Non-overlapping approach 

The applieation of non-overlapping synchronous I/O to our in-core algorithm (hp- 
GWAS) results in Algorithm 7. The algorithm first computes the operations common 
to every GLS problem (Hues 1 to 5) and then iterates over the stream of X^'s (lines 
6 to 14). At each iteration, the following actions are performed: 

(1) read the Xr^s for a block of successive GLS problems, 

(2) compute the solutions, 5's, of such problems, and 

(3) write the 6's to disk. 
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Iteration i 

> Store b[i-l] 
>LoadXR[i+l] 

Iteration i + 1 

► Compute b 



r- Main memory ^ 









Workspace 1 



XR 





- | Workspace 2 
XR 




Rok^^e^hange 



Workspace 2 



XR 



Iteration i 
• Compute b [i] 



Iteration i + 1 
» Store b[i] 
> Load XR[i+2] 



Fig. 6: Workspaces for double buffering. The main memory is divided, from left 
to right, in global data, workspace 1, and workspace 2. Initially, workspace 1 is 
used for I/O and workspace 2 is used for computation. After each iteration, the 
workspaces exchange roles. 



Both I/O requests (lines 7 and 14) are synchronous: after the requests are issued, 
the processor enters a wait state until the I/O transfer has completed. Fig. 5 de- 
picts this shortcoming: red (dark) regions represent computation stalls where the 
processor waits for data to be read or written; blue (light) regions represent actual 
computation. Since loading data from secondary memory is orders of magnitude 
slower than loading data from main memory, I/O operations introduce a consid- 
erable overhead that negatively impacts performance. For the scenario described 
above, in which n = 10,000, _p = 4, and m = 36,000,000, synchronous I/O applied 
to HP-GWAS causes a 5% to 10% overhead. 

4.2 Overlapping approach - Double buffering 

To put double buffering into practice, the main memory is split into two workspaces: 
one for downloading and uploading data and one for computation. Also, the data 
streams are divided into blocks such that they fit in the corresponding workspaces. 
While iterating over the blocks, the workspaces alternate their role, allowing the 
overlapping of I/O with computation, and reducing or even eliminating the overhead 
due to I/O. Specifically for GWAS, both workspaces are subdivided in individual 
buffers, one for each operand to be streamed, Xji and b. As illustrated in Fig. 6, at 
iteration i, results from the previous iteration are located in Workspacel: :b and 
input data for next iteration is to be loaded in Workspacel: :XR. Simultaneously, 
GLS problems corresponding to the current iteration i are computed and stored in 
Workspace2: :b. At iteration the workspaces exchange their role: Workspacel 
is used for computation, and Workspace2 is used for I/O. 

It remains to be addressed how the downloading and uploading of data is actually 
performed in the background while computation is being carried out. Our approach 
is based on the use of asynchronous libraries, which allow a process to request the 
prefetching of data needed for the next iteration: data is loaded in the background 
while the process carries out computation with the current data set. 

In Algorithm 8 we provide the out-of-core algorithm OOC-hp-GWAS that applies 
double-buffering to hp-GWAS. At each iteration i over the blocks of data (lines 6 
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to 16), the algorithm performs the following steps: 

(1) request the loading of the next block of input data (XR[i+l]), 

(2) wait, if necessary, for the current block of data (XR[i] ), 

(3) compute the current set of problems defined by the current set of data, 

(4) request the storage of current results (b[i]), and 

(5) wait, if necessary, until previous results are stored (b[i-l]). 

As illustrated in Fig. 7, a perfect overlapping of I/O with computation means 
that no I/O is exposed and no processor idles waiting for I/O operations. 



15 



Algorithm 8 (ooc-hp-GWAS): Out-of-core algorithm for GWAS based on double-buffering^ 

LL'^ = M (potrf) 
Xl:=L-^Xl (trsm) 
y := L^^y (trsv) 
Stl-^XIXl (syrk) 
br X^y (gemv) 
for each Xuk in Xu 

async_read ( next Xuk) 

wait (current Xuk) 

Xuk L-^Xbik (trsm) 

for each Xj^. in Xi^ik 

Sbu-=Xr,Xl (gemv) 

SBB.r-= x]1,Xr, (dot) 
bsr-^xl^y (dot) 

b,■.= Sl'^b^ (POSV) 

async-write ( current buk) 



16 wait ( previous buk) 



4.3 Sustaining in-core high performance 

A perfect overlapping is only one of two requirements for the out-of-core routine 
to sustain in-core high performance. The second is to ensure that the operations 
within the loop over the stream of data (lines 6 to 16) attain the same efhciency 
as in the in-core routine. Both requirements depend on the number of threads and 
the block size, i.e., the number of Xjis, loaded at each iteration. 

To completely eliminate the overhead due to data movement from disk to memory 
and vice versa, the following equation must hold: 

time (computation) > time(load) + time(store). 

The block size has to be large enough to ensure that for each iteration the time 
spent in computing is larger than the time spent on storing the previous results 
and loading data for the next iteration. Since the computation time varies with the 
number of threads, the block size needs to be adjusted accordingly. 

Although it may seem that the best approach to select a block size is to sim- 
ply maximize memory usage, the initial overhead must be taken into account: the 
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READ Xbik-. 



WRITE fofcft, 
READ Xbit' 



library 



Computation 



WRITE hbik,,^ r: 

READ X4«,^3 ^ 



COMP(Xblk„y) 



COMP(Xfc«, - 



Fig. 7: Overlapping approach to data streaming, based on an asynchronous library, for out-of-core 
GWAS. The figure depicts a perfect overlapping of I/O with computation. 



loading of the first block of data is not overlapped with computation. In systems 
equipped with large amounts of main memory, it is advised to initiate the com- 
putation with a small block size to avoid exposed I/O, and increase it after a few 
iterations. 



5. PERFORMANCE RESULTS (II) 

In this section, we focus on the experimental results for OOC-hp-GWAS, our out-of- 
core routine. To measure the performance of the incorporated out-of-core mecha- 
nism, we compare the timings with those of the in-core routine, previously shown 
in Fig. 4. We complete the picture with a comparison between the timings of our 
routines and those of the in-core and out-of-core implementations of GWFGLS. All 
routines were written in C, and the experiments were run in the same environment 
as Section 3. In addition, OOC-hp-GWAS utilizes the AIO (asynchronous input/out- 
put) library, available on UNIX systems as part of their standard libraries. 

In Fig. 8, we combine timings for both the in-core routine hp-GWAS and the 
out-of-core routine OOC-hp-GWAS. The in-core routine is used for problems whose 
data sets fit in main memory, and we switch to the out-of-core routine for larger 
problems. The vertical line indicates the size of the largest problem that can be 
solved in-core. The figure shows that, thanks to the double-buffering and an ap- 
propriate choice of the block size, OOC-hp-GWAS achieves a perfect overlapping of 
I/O with computation. As a consequence, OOC-HP-GWAS is able to sustain in-core 
performance for problems as large as the hard-drive size. 

In Table III, we collect timings for both our routines and GWFGLS in both in- 
core and out-of-core scenarios. The provided ratios confirm the impact of using 
a sub-optimal approach to out-of-core: While, as seen in Fig. 8, the overlapping 
I/O mechanism incorporated in OOC-hp-GWAS sustains in-core performance, the 
non-overlapping approach in GWFGLS results in a 10% to 15% overhead. As a 
consequence, the speedup over GWFGLS raises from 50 to 58. 
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Fig. 8: Our out-of-core routine, OOC-hp-GWAS, sustains in-core performance for 
problems as large as the available secondary storage of 1 terabyte. The vertical line 
indicates the size limit for the in-core routine. The results were obtained using 12 
threads. The other problem dimensions are: n = 10,000, and p = 4. We used a 
block size of 5,000 throughout. 



m = 


10,000 


50,000 


100,000 


500,000 


1,000,000 


5,000,000 


10,000,000 


GWFGLS 
*HP-GWAS 

GWFGLS 

Ratio ,„p_^,^,,. 


429.2 
10.9 
39.2 


2,072.5 
43.0 
48.1 


4,117.9 
82.6 
49.9 


24,065 
414 
58.1 


48,130 
816 
58.9 


240,650 
4,184 
57.5 


481,300 
8,343 
57.7 



Table III: Timings for GWFGLS and *HP-GWAS (hp-GWAS and OOC-hp-GWAS) for 
both in-core and out-of-core scenarios. The problem dimensions are: n = 10,000 
and p = 4, for an increasing value of m. The results were obtained using 12 threads. 
The timings are in seconds. The double vertical line separates timings for the in-core 
(left) and out-of-core (right) routines. The increase in speedup in the out-of-core 
case reflects the overhead introduced in GWFGLS due to a non-overlapping I/O. 



The largest tests presented, involving 10 millions of genetic markers (X^'s), took 
less than 2.5 hours with OOC-hp-GWAS. This means that a complete genome- 
wide scan of association between 36 millions of genetic markers in a population of 
10,000 individuals takes now slightly more than 8 hours, and is only limited by the 
availability of a large (and cheap) secondary storage device. 



6. FUTURE WORK 

As current biomedical research experiences a large boost in the amount of available 
genomic data, computational biologists are eager to solve problems of ever increas- 
ing size. Even though the time spent to perform genome-wide analysis is reduced 
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significantly thanks to the techniques presented in this paper, further speedups are 
required to satisfy future needs. 

Throughout the paper wc assumed, as is the case in current analyses, that the 
matrix M fits in main memory. In this context, a further reduction of execu- 
tion time can be achieved through distributed-memory architectures via an MPI 
parallclization: the processes would first perform the Cholcsky factorization of M 
redundantly, and then operate on distinct chunks of Xr. In addition, if GPUs with 
enough memory to host M were available, the routines could be further sped up by 
offloading the computation of the trsm (line 8 in Alg. 8) onto the devices. 

When instead M does not fit in main memory, one should rely on approaches 
based on out-of-core algorithms-by-blocks [Quintana-Ortf et al. 2009; Quintana- 
Ortf et al. 2012] and libraries such as ScaLAPACK [Blackford ct al. 1997] and 
Elemental [Poulson et al. 2012], respectively for shared-memory and distributed- 
memory architectures. 

7. CONCLUSIONS 

We tackled a problem, extremely common in bioinformatics, that requires both 
high-performance computing and storage. Neither general nor domain-specific li- 
braries provide a viable solution. Indeed, due to the expected execution time and 
storage requirements, it was believed that these problems could be solved exclu- 
sively with the aid of supercomputers. This paper instead demonstrates that a 
single multi-core node suffices. 

We presented high-performance algorithms, and their corresponding implemen- 
tations, for the solution of sequences of generalized least-squares problems (GLSs) 
in the context of genome-wide association studies (GWAS). When compared to 
the widely used gwfgls routine from the GenABEL package, our routines attain 
speedups larger than 50. 

Our routines are specifically tailored for multi-threaded architectures. We fol- 
lowed an incremental approach: starting from an algorithm to solve one single 
GLS, we detailed the steps towards a high-performance algorithm for GWAS. At 
each step, we identified the limitations of current existing libraries and tools, and 
described the key insight to overcome such limitations. 

First, we showed that no matter how optimized is a routine to solve a single GLS 
instance, it cannot possibly compete with tools specifically designed for GWAS: it is 
imperative to take advantage of the sequence of correlated problems. This discards 
the black-box approach of traditional libraries. 

Then, we identified GWFGLS' issues regarding efficiency, scalability, and data 
handling, and detailed how we addressed them. Taking advantage of problem sym- 
metries and application-specific knowledge, we were able to completely eliminate 
redundant computation. Next, we pointed out that even BLAS-3 kernels might 
suffer from low efficiency. A careful rearrangement of the operations leads to an 
efficiency of 94%. Combining two kinds of parallelism -a multi-threaded version of 
BLAS and OpenMP parallelism-, our in-core solver attains speedups close to 11 on 
12 cores. 

Finally, thanks to an adequate utilization of the double-buffering technique, al- 
lowing for a perfect overlapping of data transfers with computation, our out-of-core 
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routine not only inherits in-core efficiency and scalability, but it is also capable 
of sustaining the achieved high performance for problems as large as the available 

secondary storage. 

As an immediate result, our routines enable genome-wide association studies 
of unprecedented size and shift the limitation from computation time to size of 
secondary storage devices. 
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